Using Word-Sense Disambiguation Methods to Classify Web Queries by Intent
نویسندگان
چکیده
Three methods are proposed to classify queries by intent (CQI), e.g., navigational, informational, commercial, etc. Following mixed-initiative dialog systems, search engines should distinguish navigational queries where the user is taking the initiative from other queries where there are more opportunities for system initiatives (e.g., suggestions, ads). The query intent problem has a number of useful applications for search engines, affecting how many (if any) advertisements to display, which results to return, and how to arrange the results page. Click logs are used as a substitute for annotation. Clicks on ads are evidence for commercial intent; other types of clicks are evidence for other intents. We start with a simple Naı̈ve Bayes baseline that works well when there is plenty of training data. When training data is less plentiful, we back off to nearby URLs in a click graph, using a method similar to Word-Sense Disambiguation. Thus, we can infer that designer trench is commercial because it is close to www.saksfifthavenue.com, which is known to be commercial. The baseline method was designed for precision and the backoff method was designed for recall. Both methods are fast and do not require crawling webpages. We recommend a third method, a hybrid of the two, that does no harm when there is plenty of training data, and generalizes better when there isn’t, as a strong baseline for the CQI task. 1 Classify Queries By Intent (CQI) Determining query intent is an important problem for today’s search engines. Queries are short (consisting of 2.2 terms on average (Beitzel et al., 2004)) and contain ambiguous terms. Search engines need to derive what users want from this limited source of information. Users may be searching for a specific page, browsing for information, or trying to buy something. Guessing the correct intent is important for returning relevant items. Someone searching for designer trench is likely to be interested in results or ads for trench coats, while someone searching for world war I trench might be irritated by irrelevant clothing advertisements. Broder (2002) and Rose and Levinson (2004) categorized queries into those with navigational, informational, and transactional or resourceseeking intent. Navigational queries are queries for which a user has a particular web page in mind that they are trying to navigate to, such as greyhound bus. Informational queries are those like San Francisco, in which the user is trying to gather information about a topic. Transactional queries are those like digital camera or download adobe reader, where the user is seeking to make a transaction or access an online resource. Knowing the intent of a query greatly affects the type of results that are relevant. For many queries, Wikipedia articles are returned on the first page of results. For informational queries, this is usually appropriate, as a Wikipedia article contains summaries of topics and links to explore further. However, for navigational or transactional queries, Wikipedia is not as appropriate. A user looking for the greyhound bus homepage is probably not interested in facts about the company. Similarly, someone looking to download adobe reader will not be interested in Wikipedia’s description of the product’s history. Conversely, for informational queries, Wikipedia articles tend to be appropriate while advertisements are not. The user searching for world war I trench might find the Wikipedia article on trench warfare useful, while he is prob-
منابع مشابه
Understanding Users Intent by Deducing Domain Knowledge Hidden in Web Search Query Keywords
Search Engines are used by people on a daily basis to retrieve information from the web. When an ambiguous word is present in a query, specific sense of the keyword is not considered during the search process. Search engines return a large amount of web pages as results from all the possible contexts. Users tend to browse only few pages. Improving quality of retrieved results is a challenge and...
متن کاملUsing crowd-sourcing for query classification and analysis
In order to gain a better understanding of users and their intent behind web searching activities, first steps involve the analysis of the query submitted by the user and correct categorical classification of the query as a input for further analyses. Natural Language Processing (NLP) is an area with many inaccuracies for problem areas in Word Sense Disambiguation (WSD) and terms detected that ...
متن کاملA Genetic Fuzzy Semantic Web Search Agent Using Granular Semantic Trees for Ambiguous Queries
For most Web searching applications, queries are commonly ambiguous because words or phrases have different linguistic meanings for different Web users. The conventional keyword-based search engines cannot disambiguate queries to provide relevant results matching Web users’ intents. Traditional Word Sense Disambiguation (WSD) methods use statistic models or ontology-based knowledge systems to m...
متن کاملOntology Based Query Expansion Using Word Sense Disambiguation
The existing information retrieval techniques do not consider the context of the keywords present in the user’s queries. Therefore, the search engines sometimes do not provide sufficient information to the users. New methods based on the semantics of user keywords must be developed to search in the vast web space without incurring loss of information. The semantic based information retrieval te...
متن کاملA Highest Sense Count Based Method for Disambiguation of Web Queries for Hindi Language Web Information Retrieval
The ambiguity in word senses has been recognized as a major challenge for the information retrieval systems. Hindi language web information retrieval, like other languages, faces the problem of sense ambiguity. The sense ambiguity problem deteriorates the performance of every natural language processing (NLP) application. The performance of Hindi language web information retrieval is also affec...
متن کامل